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Abstract: Eventual consistency of replicated data supports concurrent updates, reduces 
latency and improves fault tolerance, but forgoes strong consistency. Accordingly, several 
cloud computing platforms implement eventually-consistent data types. 

The set is a widespread and useful abstraction, and many replicated set designs have been 
proposed. We present a reasoning abstraction, permutation equivalence, that systematizes 
the characterization of the expected concurrency semantics of concurrent types. Under 
this framework we present one of the existing conflict-free replicated data types, Observed- 
Remove Set. 

Furthermore, in order to decrease the size of meta-data, we propose a new optimization 
to avoid tombstones. This approach that can be transposed to other data types, such as 
maps, graphs or sequences. 
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Optimisation d'un type de donnAl'es ensemble 
r AlpliquAr sans conflit 

Resume : La rAIplication des donnAles avec cohAl'rence Aa terme permet lcs miscs Aa 
jour concurrentes, rAl'duit la latence, et amAlliore la tolAlrance aux fautes, mais abandonnc 
la cohAl'rence forte. Aussi, cettc approche est utilisAle dans plusieurs plateformes dc nuagc. 

L' ensemble (Set) est une abstraction largement utilisAle, et plusieurs modAllcs d'cnscmble 
rAl'pliquAl's ont Al'tAl proposAl's. Nous prAlscntons I'Al'quivalence de permutation, un 
principe de raisonnement qui caract Alrise de faAgon syst Al'matique la sAl'mantiquc attcnduc 
d'un type de donnAl'es concurrent. Ce principe nous permet d'expliquer la conception un 
type dAljAa connu, Observed-Remove Set. 

Par ailleurs, afin de diminuer la taille des mAl'ta-donnAlcs, nous proposons une nouvcllc 
optimisation qui Al'vite les AnAapierres tombalesAaAz. Cette approche peut se transposer 
Aa d'autres types de donnAl'es, commc lcs mappes, les graphes ou les sAlquences. 

Mots-cles : RAIplication des donnAl'es, rAIplication optimistc, opAl'rations commutatives 



An optimized conflict-free replicated set 



3 



1 Introduction 

Eventual consistency of replicated data supports concurrent updates, reduces latency and 
improves fault tolerance, but forgoes strong consistency (e.g., linearisability) . Accordingly, 
several cloud computing platforms implement eventually-consistent replicated sets [TT] . 
Eventual Consistency, allows concurrent updates at different replicas, under the expectation 
that replicas will eventually converge [13] - However, solutions for addressing concurrent 
updates tend to be either limited or very complex and error-prone [7]. 

We follow a different approach: Strong Eventual Consistency (SEC) [9] requires a deter- 
ministic outcome for any pair of concurrent updates. Thus, different replicas can be updated 
in parallel, and concurrent updates are resolved locally, without requiring consensus. Some 
simple conditions (e.g., that concurrent updates commute with one another) are sufficient 
to ensure SEC. Data types that satisfy these conditions are called Conffict-Free Replicated 
Data Types (CRDTs). Replicas of a CRDT object can be updated without synchroniza- 
tion and are guaranteed to converge. This approach has been adopted in several works 



The set is a pervasive data type, used either directly or as a component of more complex 
data types, such as maps or graphs. This paper highlights the semantics of sets under 
eventual consistency, and introduces an optimized set implementation, Optimized Observed 
Remove Set. 

2 Principle of Permutation Equivalence 

The sequential semantics of a set are well known, and are defined by individual updates, 
e.g., {true} add(e){e £ S} (in "{pre-condition} computation {post-condition}" notation), 
where S denotes its abstract state. However, the semantics of concurrent modifications is 
left undcrspccificd or implementation-driven. 

We propose the following Principle of Permutation Equivalence [5] to express that con- 
current behaviour conforms to the sequential specification: "If all sequential permutations 
of updates lead to equivalent states, then it should also hold that concurrent executions of 
the updates lead to equivalent states." It implies the following behavior, for some updates 
U and u' : 



Specifically for replicated sets, the Principle of Permutation Equivalence requires that 
{e ^ f}add(e) || remove(f){e € SAf ^ S}, and similarly for operations on different elements 
or idempotent operations. Only the pair add(e) \\ remove(e) is unspecified by the principle, 
since add(e); remove{e) differs from remove(e); add(e). Any of the following post-conditions 
ensures a deterministic result: 



[ED El 113 El El ■ 



{P}u;u'{Q}A{P}u';u{Q'} AQ^Q' => {P}u\\u'{Q} 



{-L e € S} 
{e€S} 
{e i S} 



- Error mark 

- add wins 



- remove wins 



{add(e) >clk remove(e) OeeS} 



Last Writer Wins (LWW) 



RR n° 8083 



4 



Bieniusa, Zawirski, PreguiAga, Shapiro, Baquero, Balegas, Duarte 



Replica 1 



add(e) 



add[g)- 



{ [e] ) [ Ifi } 



Replica 2 
addlj) 



[ 1 ( 



[ {e,f,g) | ( g 



(a) Dynamo shopping cart 



rem(e) 



Replica 1 
tjdd(e) 



rem(e) 
addle) 



fe) 
e: + 1 - 



Replica 2 
add{e) 





e; + 2 - 2 



e: + 2 - 



I) 

e: + 2 - 2 



< ^ synchronize 





e: + 4 - 4 





e: + 4 - 4 



(b) C-Set 



<jdd(e] 



Replica 1 
addle)* 

addUle.n) 



remle) A> 
remU({(e,n),{e,n')}) 

addle) + 
addUle.n") 



ie) "| 
T-fl J 


( {e} 
{ E "i e 'lf } 


< """""""" > 


(e) ) 

£-{(e,n),(t,n')l 
T=ll J 


( w 


o 1 

Ml 
T.lle.nUe.nl) 1 




ie) ) 
Mle.n")) 
T-{(e,n),(e,Bl) , 


f {e} 

, r=((t,n),(e,n')} 


< iyntnroniie ,} 


le] 





Replica 2 

addle) + 
addU(e,n') 



I r=(le.n),fen')i , 



f-/(e.n"l,ie,n"'l] 
I TMIe.Me.nV 



(c) OR-Set 



Figure 1: Examples of anomalies and a correct design. 



where <clk compares unique clocks associated with the operations. Note that not all con- 
currency semantics can be explained as a sequential permutation; for instance no sequential 
execution ever results in an error mark. 



3 A Review of Existing Replicated Set Designs 

In the past, several designs have been proposed for maintaining a replicated set. Most of 
them violate the Principle of Permutation Equivalence (Fig. [TJ. For instance, the Amazon 
Dynamo shopping cart 3 is implemented using a register supporting read and write (as- 
signment) operations, offering the standard sequential semantics. When two writes occur 
concurrently, the next read returns their union. As noted by the authors themselves, in case 



of concurrent updates even on unrelated elements, a remove may be undone (Fig. 1(a) I. 

Sovran et al. and Asian et al. [Hill] propose a set variant, C-Set, where for each element 
the associated add and remove updates are counted. The clement is in the abstraction if their 



difference is positive. C-Set violates the Principle of Permutation Equivalence (Fig. 1(b)). 
When delivering the updates to both replicas as sketched, the add and remove counts are 
equal, i.e., e is not in the abstraction, even though the last update at each replica is add(e). 



4 Add-wins Replicated Sets 

In Section [2] we have shown that when considering concurrent add and remove operations 
over the same element, one among several post-conditions can be chosen. Considering the 
case of an add wins semantics we now recall [5] the CRDT design of an Observed Remove 
Set, or OR-Set, and then introduce an optimized design that preserves the OR-Set behaviour 
and greatly improves its space complexity 
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payload set E, set T 
initial 0, 

query contains (element e) : boolean b 

let b = (3n : (e,n) e E) 
query elements () : set S 

let S = {e\3n : (e, n) e £} 
update add (element e) 
prepare (e) 

let n = unique() 
effect (e, n) 

E := EU {(e,n)}\T 
update remove (element e) 
prepare (e) 

let _R = {(e, n)\3n : (e, n) 6 E} 
effect (R) 

E := E\ R 
T :=TU R 
compare (A, B) : boolean b 

let b = ((A.E U A.T) C (B.E U B.T)) A (AT C B.T) 
merge (B) 

E := (E\B.T) U (B.E\T) 
T := TU B.T 

Figure 2: OR-Set: Add- wins replicated set 

These CRDT specifications follow a new notation with mixed state- and operation-based 
update propagation. Although the formalization of this mixed model, and the associated 
proof obligations that check compliance to CRDT requisites, is out of the scope of this report 
the notation is easy to infer from the standard CRDT model [HI El HH] ■ 

System model synopsis: We consider a single object, replicated at a given set of 
processes/replicas. A client of the object may invoke an operation at some replica of its 
choice, which is called the source of the operation. A query executes entirely at the source. 
An update applies its side effects first to the source replica, then (eventually) at all replicas, 
in the downstream for that update. To this effect, an update is modeled as an update pair 
(p, u) that includes two operations such that p is a side-effect free prepare (-update) operation 
and u is an effect(-update) operation; the source executes the prepare and effect atomically; 
downstream replicas execute only the effect u. In the mixed state- and operation-based 
modelling, replica state can both be changed by applying an effect operation or by merging 
state from another replica of the same object. The monotonic evolution of replica states is 
described by a compare operation, supplied with each CRDT specification. 

4.1 Observed Remove Set 

Figure [2] shows our specification for an add- wins replicated set CRDT. Its concurrent spec- 
ification {P}uo || . . . || u n ^i{Q} is for each element e defined as follows: 



— E: elements; T: tombstones 
sets of pairs { (element e, unique-tag n), . . . } 



— uniqueQ returns a unique tag 
— e + unique tag 
— Collect pairs containing e 
Remove pairs observed at source 
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• Vi, Ui = remove{e) =>■ Q — (e ^ 5 1 ) 

• 3i : «, = add(e) =4> Q = (e e S). 

To implement add-wins, the idea is to distinguish different invocations of add(e) by 
adding a hidden unique token n, and effectively store (e, n) pair. A pair (e, n) is removed by 
adding it to a tombstone set. An clement can be always added again, because the new pair 
(e, n') uses always a fresh token, different from the old one, n' ^ n. If the same element e is 
both added and removed concurrently, the update-prepare of remove concerns only observed 
pairs (e, rii), (e, 712), . . . and not the concurrently-added unique pair (e, n'). Therefore the 
add wins by adding a new pair. We call this object an Observed Remove Set, or OR-Set. 
As illustrated in Figure l(c)[ OR-Set is immune from the anomaly that plagues C-Set. 

Space complexity: The payload size of OR-Set is at any moment bounded by the 
number of all applied add {effect-update) operations. 



4.2 Optimized Observed Remove Set 

The OR-Set design uses extensively unique identifiers and tombstones, as other CRDTs 
[BJ 1141 [S] . We now show how to make CRDT practical by minimizing the required meta- 
data. 

Immediately discarding tombstones: When comparing two payloads P and P' , 
respectively containing some clement e and the other not, it is important to know if e has 
been recently added to P, or if it was recently removed from P'. The presented add-wins set 
uses tombstones to unambiguously answer this question, even when updates are delivered 
out of order or multiple times. 

Tombstones accumulate (as a consequence of the monotonic semilattice requirement); if 
they cannot be discarded, memory requirements grow with the number of operations. To 
address this issue, Wuu's 2P-Set [TS] garbage-collects tombstones that have been delivered 
everywhere, basically by waiting for acknowledgements from each process to every other 
process. This adds communication and processing overhead, and requires all processes to be 
correct. We devise a novel technique to eliminate tombstones without these limitations and 
offer conflict-free semantics at an affordable cost. We present our solution using add-wins 
as the example. 

To recapitulate, in OR-Set, adding an element e creates a new unique (e, n) pair to 
the E part of the payload. Removing an element moves all pairs containing e observed 
at the source from E to T[j] Note that adding some pair (e, n) always happens-before 
removing the same pair (e,n). If updates are delivered only in causal order, once, the add 
always executes before any related removes, and the tombstone set T is not necessary when 
executing operations. However, we also need to support state-based merge, which joins two 
replicas possibly unrelated by happens-before. When merging two replicas in which only 

1 A practical implementation will just set a mark bit on the representation of the removed pair and will 
deallocate any other associated storage. Consider for instance the extension of OR-Set to a map: a key will 
have some associated value, e.g., E would contain triples (e, n, value). When the key is removed, value can 
be discarded, but the corresponding (e, n) pair(s) must remain in T. 
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payload set E, vect v 



initial 0, [0, ... ,0] 

query contains (element e) : boolean b 

let 6 = (3c, i : (e, c, i) 6 E) 
query elements () : set S 

let 5 = {e\3c,i : (e,c,i) £ £} 
update add (element e) 
prepare (e) 

let r = mylDQ 
let c = «;[r] + 1 
effect (e, c, r) 

pre causal delivery 
if c > u[r] then 

let O = {(e,c',r) 6 S|c' < c} 
i>[r] := c 

E := EU {(e,c,r)}\0 
update remove (element e) 
prepare (e) 

let R = {(e,c,i) e E} 

effect (K) 

pre causal delivery 
E := E\ R 
compare (A, B) : boolean b 

let R = {(c, i)|0 < c < A.v[i] A : (e, c, i) e A.E} 

let _R' = {(c, i)|0 < c < A : (e, c, i) G -B.-B} 

let b = A.v < B.v A RC R' 
merge (B) 

let M = (Bn _B._B) 

let M' = {(e, c, i) e E \ B.E\c > 

let M" = {(e, c, i) G -B.E \ _B|c > v[i]} 

let (7 = M U M' U A/" 

let O = {(e,c,i) e U\3(e,c',i) eU : c < c'} 

E:=U\0 

v := [max(i)[0], B.v[0]), . . . , max(u[n], B.?)[n])] 



— E: elements, set of triples (element e, timestamp c, replica i) 
— v: summary ( vector) of received triples 



r — source replica 



Collect all unique triples containing e 
— Remove triples observed at source 



Figure 3: Optimized OR-Set (Opt-OR-Set). 



one replica has some pair (e,n), we need to know if the pair has been added to the replica 
that contains it or if it was removed in the other replica. 

We leverage these observations to propose a novel remove algorithm that discards a 
removed pair immediately and works safely with merge. It compactly records happcns- 
beforc information to summarizes removed elements. Figure [3] presents Optimized OR-Set 
(Opt-OR-Set) based on this approach. 

Each replica i maintains a vector v [5] to summarize the unique identifiers it has already 
observed. Entry v[j] — n at replica i indicates that this replica has observed n successive 
identifiers generated at j: (1, j), (2, j), . . . , (n, j). Replica i maintains its local counter as 
the i-th entry in the vector v[i], initially 0. A replica generates new unique identifiers (c, i) 
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by incrementing its local counter. Note that to summarize successive identifiers in a vector, 
OptORSet requires causal delivery of updates]^] 

When add is invoked, the source associates it with a unique identifier made of the next 
local counter value and source replica identifier. When the add is delivered to a downstream 
replica, it should have an effect only if it has not been previously delivered; for this, it checks 
if the unique identifier is incorporated in the downstream replica's vector. When mergeing 
payloads, an element should be in the merged state only if: either it is in both payloads (set 
M in Figure [3]) , or it is in the local payload and not recently removed from the remote one 
(set M') or vice-versa (M") - an element has been removed if it is not in the payload but 
its identifier is reflected in the replica's vector. 

This approach can be generalized to any CRDT where elements are added and removed, 
e.g., a sequence [SHU] or a graph [TO] . 

Coalescing repeated adds: Another source of memory growth in the original OR-Set 
is due to the elements added several times. Similarly to tombstones, they pollute the state 
with unique identifiers for every add. We observe that for every combination of element 
and source replica, it is enough to keep the identifier of the latest add, which subsumes 
previously added elements. The OptORSet specification leverages this observation in add 
and merge definitions, by discarding unnecessary identifiers (set O). 

Space complexity: The payload size of OptORSet set is bounded by 0(\ elements\n+n) 
at any moment, where n is the number of processes in the systems and \elements\ is the 
number of elements present in the set. The first component corresponds to the maximum 
number of timestamps in set E and the second captures the size of the vector v. In the 
common case, where the number of processes repeatedly invoking adds can be considered a 
constant, the payload size is 0{\elements\ + n). 



5 Conclusions 

Conflict-Free Replicated Data Types (CRDTs) allow a system to maintain multiple repli- 
cas of data that are updated without requiring synchronization while guaranteeing Strong 
Eventual Consistency. This allows, for example, a cloud infrastructure to maintain replicas 
of data in data centers spread over large geographical distance and still provide low access 
latency by choosing the closest, to client, data center. 

In this paper we reviewed existing replicated set designs and contrasted then with the 
CRDT OR-Set design, under the principle of permutation equivalence. Having in mind 
that the base OR-Set favored simplicity at the expense of scalability, we introduced a new 
optimized design, Optimized OR-Set, that greatly improves its scalability and should favor 
efficient implementations of sets and other CRDTs that share the OR-Set design techniques. 

2 It is easy to extend this solution for updates delivered out of happens-before order by using instead a 
version vector with exceptions [I]. 
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